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Abstract 

This paper describes a method for linear text seg- 
mentation which is twice as accurate and over seven 



i times as fast as the state-of-the-art ( j Rcjrnar, 1 9 9 8| ) 



distance model ( Bceferman et al., 1997a ) and word 
frequency model ( Reynar, 1999| ) to detect cohesion. 
Methods for fin ding the top ic boundaries in clude 
sliding w i ndow ([Hearst, 1994 ), lexical chains ( |Mor- ' 



rnt.pr-spnt.pnrp similarity is rpplapprl hy rank in fhp 



local context. Boundary locations are discovered by 
divisive clustering. 

1 Introduction 



ri s, 1988 ; Kan et al., 



1998 ), dynamic pr ogramming 



( Ponte and Croft, 1997 : Hcinoncn, 1998 ), agglomer 
ative clustering (Yaari, 1997) and divisive clustering 
(Reynar, 1994). Lexical cohesion methods are typi- 
cally used for segmenting written text in a collection 



Even inudeialely lung chxTrrrrerrts typically addiess 



to improve information retrieval (Hearst, 1994; Rcy 



Beveral topics or different aspects of the same topic. 
The aim of linear text segmentation is to discover 
the topic boundaries. The us es of this procedure 



1998) 



inclu d e information retrieval (Hearst and Plaunt 
1993| ; |Hearst, 1991; |Yaari, 1997| ; |Reynar, 1999| ) 



Multi-source methods combine lexical cohesion 
with other indicators of topic shift such as cue 
phrases, prosod ic features, reference, syn tax and lex- 
ical attraction 



Bceferman et al. 



. 1997e ) using deci- 

i i' i — — V" t ' i' r-v — ~"" |) i s i on trees (Miike et al., 1994 ; Kurohashi and Nagao 

^mmamaUoii (|loyiiar, 199fyUaa understanding, ~ jg|||0jgs and p asS o nnc I!; M and probabilil- 



■iiiaphu ia lesuluLiun ([Kuz iiiia, 199^ ), language mod 



iling (|M 



iui i is ann 



1 IlhsL, 199l F peefeiiiiaii cL al 



1997b ) and impro ving docum ent navigation for th 



tic models (|Beeferman et al., 1997b| ; [Hajimc et al. 



1998; Reynar, 1998). Work in this area is largely mo- 



visually disabled (Choi, 2000) 



This paper focuses on domain independent meth- 
ods for segmenting written text. We present a new 
algorithm that builds on pre vious work by Reynar 
(Reynar, 1998; Reynar, 1994). The primary distinc- 
tion of our method is the use of a ranking scheme 



and the cosine similarity measure ( van Rijsbergen 



1979) in formulating the similarity matrix. We pro- 
pose that the similarity values of short text segments 
is statistically insignificant. Thus, one can only rely 
on their order, or rank, for clustering. 

2 Background 

Existing work falls into one of two categories, lexical 



tivated by the topic detectio n and tracking (TDT) 
initiative (Allan et al., 1998). The focus is on the 
segmentation of transcribed spoken text and broad- 
cast news stories where the presentation format and 
regular cues can be exploited to improve accuracy. 

3 Algorithm 

Our segmentation algorithm takes a list of tokenized 



sentences as inpu t. A tokenizer ( prefenstctte and 
Tapanainen, 1994) a nd a sentence boundary d i sam- 
biguation algorithm (Palmer and Hearst, 1994 ; Rcy 



tiar and Ratnaparkhi, 1997 ) or EAGLE flR.cynar et 
al., 1997) may be used to convert a plain text docu- 



cohes ion methods and multi-source methods (Yaari 



1997). The former stem from the work of Halliday 
and Hasan (Halliday and Hasan, 1976). They pro- 



posed that text segments with similar vocabulary 
are likely to be part of a coherent topic segment. 
Imple mentations of thi s idea use word stem rcpe- 
tition (lYoumans, 1991 ; Reynar, 1994 : Ponte and 



Croft, 1997 ), context vectors ( Hearst, 1994 ]_ Yaari 
19971 ; iKaufmann, 1999|; |Eichmann et al., 1999[ ), en 



tity repetition ([Kan et al.. 1998 ) , semantic similar- 
ity ( Morris and Hirst, 1991 ; Kozima, 1995 ), word 



ment into the acceptable input format. 
3.1 Similarity measure 

Punctuation and uninformative words are removed 
from each sentence using a simple regular expression 
pattern ma tcher and a s topword list. A stemming 
algorithm ( Porter, 1980 ) is then applied to the re- 
maining tokens to obtain the word stems. A dic- 
tionary of word stem frequencies is constructed for 
each sentence. This is represented as a vector of 
frequency counts. 

Let fij denote the frequency of word j in sentence 
i. The similarity between a pair of sentences x,y 



is computed using the cosine measure as shown in 
equation [|. This is applied to all sentence pairs to 
generate a similarity matrix. 



sim(x, y) 



S 7 - / x ,j x fy 



(1) 



Figure ^ shows an example of a similarity matrix]]. 
High similarity values are represented by bright pix- 
els. The bottom-left and top-right pixel show the 
self-similarity for the first and last sentence, respec- 
tively. Notice the matrix is symmetric and contains 
bright square regions along the diagonal. These re- 
gions represent cohesive text segments. 




Each value in the similarity matrix is replaced by 
its rank in the local region. The rank is the num- 
ber of neighbouring elements with a lower similarity 
value. Figure shows an example of image ranking 
using a 3 x 3 rank mask with output range {0,8}. 
For segmentation, we used a 11 x 11 rank mask. The 
output is expressed as a ratio r (equation ^) to cir- 
cumvent normalisation problems (consider the cases 
when the rank mask is not contained in the image). 
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Figure 2: A working example of image ranking. 



Figure 1: An example similarity matrix. 
3.2 Ranking 

For short text segments, the absolute value of 
sim(x, y) is unreliable. An additional occurrence of 
a common word (reflected in the numerator) causes 
a disproportionate increase in sim(x, y) unless the 
denominator (related to segment length) is large. 
Thus, in the context of text segmentation where a 
segment has typically < 100 informative tokens, one 
can only use the metric to estimate the order of sim- 
ilarity between sentences, e.g. a is more similar to b 
than c. 

Furthermore, language usage varies throughout a 
document. For instance, the introduction section of 
a document is less cohesive than a section which is 
about a particular topic. Consequently, it is inap- 
propriate to directly compare the similarity values 
from different regions of the similarity matrix. 

In non-parametric statistical analysis, one com- 
pares the rank of data sets when the qualitative be- 
haviour is similar but the absolute quantities are un- 
reliable. We present a ranking s cheme which is an 



adap tation of that described in ( O'Ncil and Denos 
19921) ■ 



7^ of elements with a lower value 
# of elements examined 



(2) 



To demonstrate the effect of image ranking, the 
process was applied to the matrix shown in figure [l] 
to produce figure |^|. Notice the contrast has been 
improved significantly. Figure |] illustrates the more 
subtle effects of our ranking scheme. r(x) is the rank 
(1 x 11 mask) of f{x) which is a sine wave with 
decaying mean, amplitude and frequency (equation 




Figure 3: The matrix in figure [l] after ranking. 



1 The contrast of the image has been adjusted to highlight 
the image features. 



2 The process was applied to the original matrix, prior to 
contrast enhancement. The output image has not been en- 
hanced. 



Rank matrix 



Step 2 



f{x) = g{x x ^) 
g(z) = i(e-^ 2 + i e - z / 2 (l + sin(10z - 7 ))) 



(3) 





Step 1 





Figure 4: An illustration of the more subtle effects 
of our ranking scheme. 

3.3 Clustering 

The final process determines the location of the topic 
boundaries. The method is based on Reynar's max- 
imisation algorithm ( Reynar. 199S ; [Hclfman, 199€ ; 
Church, 1993| ; |Church and Helfman, 1993| ). A text 
segment is defined by two sentences i,j (inclusive). 
This is represented as a square region along the di- 
agonal of the rank matrix. Let Sij denote the sum of 
the rank values in a segment and aij — (j — i + l) 2 
be the inside area. B = {bi,...,b m } is a list of m 
coherent text segments. Sk and refers to the sum 
of rank and area of segment k in B. D is the inside 
density of B (see equation 0) . 



D = 



Em 
fc=l Sk 



(4) 



Em 
fc=l a k 

To initialise the process, the entire document is 
placed in B as one coherent text segment. Each step 
of the process splits one of the segments in B. The 
split point is a potential boundary which maximises 
D. Figure |E| shows a working example. 

The number of segments to generate, m, is de- 
termined automatically. is the inside den- 
sity of n segments and 8D^> = D^> - 
is the gradient. For a document with b potential 
boundaries, b steps of divisive clustering generates 
{£>«,. ..,£>( fe+1 )} and {SD^, ...,SD^} (see fig- 
ure H and 0). An unusually large reduction in 5D 
suggests the optimal clustering has been obtained^ 

3 In practice, convolution (mask {1,2,4,8,4,2,1}) is first 
applied to 5D to smooth out sharp local changes 



Figure 5: A working example of the divisive cluster- 
ing algorithm. 

(see n = 10 in figure 0). Let /i and v be the mean 
and variance of 8D^ n \n £ {2, ...,b + 1}. m is ob- 
tained by applying the threshold, fi + cx -^/j/, to 5D 
(c = 1.2 works well in practice). 



Number of segments 



Figure 6: The inside density of all levels of segmen- 
tation. 

3.4 Speed optimisation 

The running time of each step is dominated by the 
computation of Sfc. Given Sij is constant, our algo- 
rithm pre-computes all the values to improve speed 
performance. The procedure computes the values 
along diagonals, starting from the main diagonal and 



4 Evaluation 

The definitio n of a topic segm ent ranges fro m com- 
plete stories (Allan et al., 1998) to summaries (Ponte 
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Figure 7: Finding the optimal segmentation using 
the gradient. 

works towards the corner. The method has a com- 
plexity of order l^n 2 . Let rij refer to the rank value 
in the rank matrix R and S to the sum of rank ma- 
trix. Given R of size n x n, S is computed in three 
steps (see equation ||) . Figure || shows the result of 
applying this procedure to the rank matrix in figure 

1 



and Croft, 1997). Given the quality of an algorithm 
is task dependent, the following experiments focus 
on the relative performance. Our evaluation strat- 



egy is a variant of that described in ( Reynar, 199<i 
71-73 ) and the TDT segmentation task QAllan et al 
1998| ). We assume a good algorithm is one that finds 



the most prominent topic boundaries. 

4.1 Experiment procedure 

An artificial test corpus of 700 samples is used to as- 
sess the accuracy and speed performance of segmen- 
tation algorithms. A sample is a concatenation of 
ten text segments. A segment is the first n sentences 
of a randomly selected document from the Brown 
corpus^]. A sample is characterised by the range of 
n. The corpus was generated by an automatic pro- 
cedureQ Table [j] presents the corpus statistics. 



Range of n 


3-11 


3-5 


6-8 


9-11 


# samples 


400 


100 


100 


100 



Table 1: Test corpus statistics. 
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Figure 8: Improving speed performance by pre- 
computing Sjj . 



p(error|ref, hyp, k) = 

p(miss|ref, hyp, diff, fc)p(diff|ref, k) + 
p(fa|ref, hyp, same, fc)p(same|ref, k) 



(6) 



Speed performance is measured by the average 
number of CPU seconds required to process a test 
sample^. Segmentation accuracy is measured by the 
err or metric (equation [| fa — > false alarms) proposed 
in ( [Becfcrman et al., T999 ). Low error probability 



indicates high accuracy. Other performance mea- 
sures i nclude the po pular precisi on and recall m etric 
(PR) fjHearst, 1991, fuzzy PR (|Reynar, 1998|) and 
edit distance ( Ponte and Croft, 1997[) . The prob- 
lems associated with these metrics are discussed in 



(Beeferman et al., 1999). 

4.2 Experiment 1 - Baseline 

Five degenerate algorithms define the baseline for 
the experiments. B n does not propose any bound- 
aries. B a reports all potential boundaries as real 
boundaries. B e partitions the sample into regular 
segments. B{r,l) randomly selects any number of 
boundaries as real boundaries. -B(r,6) randomly se- 
lects b boundaries as real boundaries. 



4 Only the news articles ca**.pos and informative text 
cj**.pos were used in the experiment. 

All experiment data, algorithms, scripts and detailed re- 
sults are available from the author. 

6 A11 experiments were conducted on a Pentium II 266MHz 
PC with 128Mb RAM running RedHat Linux 6.0 and the 
Blackdown Linux port of JDK1.1.7 v3. 



The accuracy of the last two algorithms are com- 
puted analytically. We consider the status of m po- 
tential boundaries as a bit string (1 — > topic bound- 
ary). The terms p(miss) and p(fa) in equation ^cor- 
responds top(same|fc) andp(diff|fc) = 1 — p(sa.me\k). 
Equation fj], || and || gives the general form of 
p(same|fc), -B( r ,?) and -B( r ,&), respectively^. 

Table presents the experimental results. The 
values in row two and three, four and five are not 
actually the same. However, their differences are 
insignifican t according to the Kolmogorov-Smirnov, 
or KS-test ([Press et al., 1992|). 



p(same|fc) 



# valid segmentations 
# possible segmentations 
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p(same|fc, B( r ,?)) 
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p(same|fc, m, B^ rb ^) 
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x\ 






y\(x-y)\ 




3-11 


3-5 


6-8 


9-11 


B e 

B n 

B( r ,b) 

B a 


45% 
47% 
47% 
53% 
53% 


38% 
47% 
47% 
53% 
53% 


39% 
47% 
47% 
53% 
53% 


36% 
46% 
46% 
54% 
54% 



(7) 
(8) 
(9) 



Table 2: The error rate of the baseline algorithms. 

4.3 Experiment 2 - TextTiling 

We co mpare three v ersions of the TextTiling algo- 
rithm ( Hearst, 1994 ). H94( c ^ is Hearst's C imple- 
mentation with default parameters. H94( cr ) uses 
the recommended parameters k = 6, w — 20. 
H9Au r \ is my implementation of the algorithm. 
Experimental result (table |J) shows i?94( c ^ and 
i?94( c r .) are more accurate than H94^ r y We sus- 
pect this is due to the use of a different stopword list 
and stemming algorithm. 



4.4 Experiment 3 - DotPlot 

Five versions of Reynar's optimisation algorithm 
( Reynar, 1998; ) were evaluated. R98 and i?98( min ) 
are exact implementations of his maximisation and 
minimisation algorithm. R98/ scos ) is my version of 
the maximisation algorithm which uses the cosine 
coefficient instead of dot density for measuring sim- 
ilarity. It inc orporates the optimisations described 

is the modularised version 



m section 



3.4. R98 



[m : dot] 



7 The full derivation of our method is available from the 
author. 





3-11 


3-5 


6-8 


9-11 


H9A [c , d) 
#94 (c , r) 
H9i (M 


46% 
46% 
54% 


44% 
44% 
45% 


43% 
44% 
52% 


48% 
49% 
53% 


H9A (c . d) 
H9i (Cir) 
H94 {j , r) 


0.67s 
0.68s 
3.77s 


0.52s 
0.52s 
2.21s 


0.66s 
0.67s 
3.69s 


0.88s 
0.92s 
5.07s 



Table 3: The error rate and speed performance of 
TextTiling. 

of i?98 for experimenting with different similarity 
measures. 

R9&( m ,sa) uses a variant of Kozima's semantic sim- 



ilarity measure (Kozima, 1993) to compute block 



similarity. Word similarity is a function of word co- 
occurrence statistics in the given document. Words 
that belong to the same sentence are considered 
to be related. Given the co-occurrence frequen- 
cies f(wi, Wj), the transition probability matrix t is 
computed by equation [Io|. Equation [ll] defines our 
spread activation scheme, s denotes the word sim- 
ilarity matrix, x is the number of activation steps 
and norm(y) converts a matrix y into a transition 
matrix, x = 5 was used in the experiment. 



ti,j = p(Wj\Wi) = 



f(Wj,Wj 



norm 




(10) 



(11) 



Experimental result (table |j) shows the cosine co- 
efficient and our spread activation method improved 
segmentation accuracy. The speed optimisations sig- 
nificantly reduced the execution time. 





3-11 


3-5 


6-8 


9-11 


i?98 


m,sa) 


18% 


20% 


15% 


12% 


i?98 


s.cos) 


21% 


18% 


19% 


18% 


i?98 


m,dot) 


22% 


21% 


18% 


16% 


i?98 


22% 


21% 


18% 


16% 


i?98 


min) 


n/a 


34% 


37% 


37% 


R98 


s.cos) 


4.54s 


2.24s 


4.36s 


6.99s 


R98 




29.58s 


9.29s 


28.09s 


55.03s 


R98 


m,sa) 


41.02s 


7.34s 


40.05s 


113.5s 


R98 


m,dot) 


46.58s 


9.24s 


42.72s 


115.4s 


R98 


min) 


n/a 


19.62s 


58.77s 


122.6s 



Table 4: The error rate and speed performance of 
Reynar's optimisation algorithm. 

4.5 Experiment 4 - Segmenter 



We c ompare three versions of Segmenter ( Kan ct al 



1998). if98( p ) is the original Perl implementation of 



the algorithm (version 1.6). K98(j) is my imple- 
mentation of the algorithm. i^98(j ) is a version of 
K98(j\ which uses a document specific chain break- 
ing strategy. The distribution of link distances are 
used to identify unusually long links. The threshold 
is a function (ilex *Jv of the mean \i and variance 
v. We found c = 1 works well in practice. 

Table ^ summarises the experimental results. 
K98/p) performed significantly better than ^98^*). 
This is due to the use of a different part-of-speech 
tagger and shallow parser. The difference in speed is 
largely due to the programming languages and term 
clustering strategies. Our chain breaking strategy 
improved accuracy (compare K98rj\ with K98(j a \). 





3-11 


3-5 


6-8 


9-11 


K98 (P ) 
K98 (i) 


36% 
n/a 
n/a 


23% 
41% 

44% 


33% 
46% 
48% 


43% 
50% 
51% 


K98 {p) 
K98 {j) 
K98 iha) 


4.24s 
n/a 
n/a 


2.57s 
21.43s 
21.44s 


4.21s 

65.54s 

65.49s 


6.00s 
129.3s 
129.7s 



Table 5: The error rate and speed performance of 
Segmenter. 

4.6 Experiment 5 - Our algorithm, C99 

Two versions of our algorithm were developed, C99 
and C99(b)- The former is an exact implementation 
of the algorithm described in this paper. The latter 
is given the expected number of topic segments for 
fair comparison with i?98. Both algorithms used a 
11x11 ranking mask. 

The first experiment focuses on the impact of our 
automatic termination strategy on C99(h) (table ||). 
C99(b) is marginally more accurate than C99. This 
indicates our automatic termination strategy is effec- 
tive but not optimal. The minor reduction in speed 
performance is acceptable. 





3-11 


3-5 


6-8 


9-11 


C99 {b) 
C99 


12% 
13% 


12% 
18% 


9% 
10% 


9% 
10% 


C99 (b) 
C99 


4.00s 
4.04s 


1.91s 
2.12s 


3.73s 
4.04s 


5.99s 
6.31s 



Table 6: The error rate and speed performance of 
our algorithm, C99. 

The second experiment investigates the effect of 
different ranking mask size on the performance of 
C99 (table 0). Execution time increases with mask 
size. A 1 x 1 ranking mask reduces all the elements in 
the rank matrix to zero. Interestingly, the increase 
in ranking mask size beyond 3x3 has insignificant 
effect on segmentation accuracy. This suggests the 



use of extrema for clustering has a greater impact on 
accuracy than linearising the similarity scores (figure 





3-11 


3-5 


6-8 


9-11 


lxl 


48% 


48% 


50% 


49% 


3x3 


12% 


11% 


10% 


8% 


5x5 


12% 


11% 


10% 


8% 


7x7 


12% 


11% 


10% 


8% 


9x9 


12% 


11% 


10% 


9% 


11 x 11 


12% 


11% 


10% 


9% 


13 x 13 


12% 


11% 


10% 


9% 


15 x 15 


12% 


11% 


10% 


9% 


17 x 17 


12% 


10% 


10% 


8% 


1 x 1 


3.92s 


2.06s 


3.84s 


5.91s 


3x3 


3.83s 


2.03s 


3.79s 


5.85s 


5x5 


3.86s 


2.04s 


3.84s 


5.92s 


7x7 


3.90s 


2.06s 


3.88s 


6.00s 


9x9 


3.96s 


2.07s 


3.92s 


6.12s 


11 x 11 


4.02s 


2.09s 


3.98s 


6.26s 


13 x 13 


4.11s 


2.11s 


4.07s 


6.41s 


15 x 15 


4.20s 


2.14s 


4.14s 


6.60s 


17 x 17 


4.29s 


2.17s 


4.25s 


6.79s 



Table 7: The impact of mask size on the performance 
of C99. 

4.7 Summary 

Experimental result (table |^) shows our algorithm 
C99 is more accurate than existing algorithms. A 
two-fold increase in accuracy and seven-fold increase 
in speed was achieved (compare C99^) with i?98). If 
one disregards segmentation accuracy, H9A has the 
best algorithmic performance (linear). C99, K98 
and i?98 are all polynomial time algorithms. The 
significance of our results has been confirmed by 
both t-test and KS-test. 





3-11 


3-5 


6-8 


9-11 


C99 (h) 


12% 


12% 


9% 


9% 


C99 


13% 


18% 


10% 


10% 


i?98 


22% 


21% 


18% 


16% 


K98 {p) 


36% 


23% 


33% 


43% 


H94 M 


46% 


44% 


43% 


48% 


H9i (jjr) 


3.77s 


2.21s 


3.69s 


5.07s 


C99 (b) 


4.00s 


1.91s 


3.73s 


5.99s 


C99 


4.04s 


2.12s 


4.04s 


6.31s 


R98 


29.58s 


9.29s 


28.09s 


55.03s 


K98 U) 


n/a 


21.43s 


65.54s 


129.3s 



Table 8: A summary of our experimental results. 



5 Conclusions and future work 

A segmentation algorithm has two key elements, a 
clustering strategy and a similarity measure. Our 



results show divisive clustering (7298) is more precise 
than sliding window (7794) and lexical chains (if 98) 
for locating topic boundaries. 

Four similarity measures were examined. The co- 
sine coefficient (i?98( SjCOS )) and dot density measure 
(R98( m ,dot)) yield similar results. Our spread activa- 
tion based semantic measure (7298/ TOiSO )) improved 
accurac y. This confirms that although Kozima's ap- 
proach ( Kozima, f993 ) is computationally expen- 
sive, it does produce more precise segmentation. 

The most significant improvement was due to our 
ranking scheme which linearises the cosine coeffi- 
cient. Our experiments demonstrate that given in- 
sufficient data, the qualitative behaviour of the co- 
sine measure is indeed more reliable than the actual 
values. 

Although our evaluation scheme is sufficient for 
this comparative study, further research requires a 
large scale, task independent benchmark, ft would 
be interesting to com pare C99 with the multi- source 
method described in ( Beeferman ct al., 1999| ) using 
the TDT corpus. We would also like to develop a lin- 
ear time and multi-source version of the algorithm. 
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